Menu

BERT End to End (Fine-tuning + Predicting) in 5 minutes with Cloud TPU

Overview

BERT, or Bidirectional Embedding Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

This Colab demonstates using a free Colab Cloud TPU to fine-tune sentence and sentence-pair classification tasks built on top of pretrained BERT models and run predictions on tuned model. The colab demonsrates loading pretrained BERT models from both TF Hub and checkpoints.

Note: You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket for this Colab to run.

Please follow the Google Cloud TPU quickstart for how to create GCP account and GCS bucket. You have $300 free credit to get started with any GCP product. You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

This notebook is hosted on GitHub. To view it in its original repository, after opening the notebook, select File > View on GitHub.

Learning objectives

In this notebook, you will learn how to train and evaluate a BERT model using TPU.

Set up your TPU environment

  Train on TPU

  1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.

  2. On the main menu, click Runtime and select Change runtime type. Set "TPU" as the hardware accelerator.

  3. Click Runtime again and select Runtime > Run All (Watch out: the "Colab-only auth for this notebook and the TPU" cell requires user input). You can also run the cells manually with Shift-ENTER.

Set up your TPU environment

In this section, you perform the following tasks:

  • Set up a Colab TPU running environment
  • Verify that you are connected to a TPU device
  • Upload your credentials to TPU to access your GCS bucket.
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf
import pandas as pd
from IPython.display import clear_output

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.
TPU address is grpc://10.2.26.18:8470
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 10041389355245699286),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 2326484809530175149),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 9870398767955466508),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 1089285905708430896),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 1150489596665938143),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 6148071924904265137),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 13505351804001368339),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 6936778516536281510),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 12547113753853953505),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1282150454679809072),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 3128273371367183462)]

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Import BERT modules

​ With your environment configured, you can now prepare and import the BERT modules. The following step clones the source code from GitHub and import the modules from the source. Alternatively, you can install BERT using pip (!pip install bert-tensorflow).

tf.__version__
'1.13.1'
!pip install bert-tensorflow
Requirement already satisfied: bert-tensorflow in /usr/local/lib/python3.6/dist-packages (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from bert-tensorflow) (1.12.0)
# import bert
# import tensorflow_hub as hub
# from bert import modeling
# from bert import tokenization
# from bert import optimization
# from bert import run_classifier
# from bert import run_classifier_with_tfhub
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling
import optimization
import run_classifier
import run_classifier_with_tfhub
import tokenization

# import tfhub 
import tensorflow_hub as hub
Cloning into 'bert_repo'...
remote: Enumerating objects: 325, done.
remote: Total 325 (delta 0), reused 0 (delta 0), pack-reused 325
Receiving objects: 100% (325/325), 234.65 KiB | 3.45 MiB/s, done.
Resolving deltas: 100% (186/186), done.
WARNING: Logging before flag parsing goes to stderr.
W0518 08:40:33.476649 140048423761792 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14

Prepare dataset

This next section of code performs the following tasks:

  • Specify task and download training data.
  • Specify BERT pretrained model
  • Specify GS bucket, create output directory for model checkpoints and eval results.
TASK = "PTTMovieReviews" #@param {type:"string"}
assert TASK in ["WSDMFakeNews", "PTTMovieReviews"]

RAW_DATA_DIR = "raw_data"
!mkdir {RAW_DATA_DIR}

if TASK == "PTTMovieReviews":
    TASK_DATA_DIR = os.path.join(RAW_DATA_DIR, "ptt_movie_review")
    !mkdir {TASK_DATA_DIR}
    
    !wget https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/ptt_movie_review_tokenized.zip
    !unzip ptt_movie_review_tokenized.zip -d {TASK_DATA_DIR}
    !rm -rf {TASK_DATA_DIR}/__MACOSX
    !mv {TASK_DATA_DIR}/PPT_Movie_Review_train-1.txt {TASK_DATA_DIR}/train.txt
    !mv {TASK_DATA_DIR}/PPT_Movie_Review_test-1.txt {TASK_DATA_DIR}/test.txt
    
    train = pd.read_csv(f"{TASK_DATA_DIR}/train.txt", header=None, sep="\t")
    train.columns = ["label", "text"]

elif TASK == "WSDMFakeNews":
    TASK_DATA_DIR = os.path.join(RAW_DATA_DIR, "wsdm_fakenews")
    !mkdir {TASK_DATA_DIR}
    
    zip_file = "drive-download-20190516T113709Z-001.zip"
    file_url = "https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/fake_news/" + zip_file
    !wget {file_url}
    !unzip {zip_file}

    !mv dev_bert.tsv dev.tsv
    !mv test_bert.tsv test.tsv
    !mv train_bert.tsv train.tsv
    !mv dev.tsv test.tsv train.tsv {TASK_DATA_DIR}

print('***** Task data directory: {} *****'.format(TASK_DATA_DIR))
!ls $TASK_DATA_DIR
mkdir: cannot create directory ‘raw_data’: File exists
mkdir: cannot create directory ‘raw_data/ptt_movie_review’: File exists
--2019-05-18 08:40:36--  https://s3-ap-northeast-1.amazonaws.com/smartnews-dmp-tmp/meng/ptt_movie_review_tokenized.zip
Resolving s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)... 52.219.68.136
Connecting to s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)|52.219.68.136|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8398006 (8.0M) [application/zip]
Saving to: ‘ptt_movie_review_tokenized.zip.1’

ptt_movie_review_to 100%[===================>]   8.01M  6.97MB/s    in 1.1s    

2019-05-18 08:40:38 (6.97 MB/s) - ‘ptt_movie_review_tokenized.zip.1’ saved [8398006/8398006]

Archive:  ptt_movie_review_tokenized.zip
  inflating: raw_data/ptt_movie_review/PPT_Movie_Review_train-1.txt  
   creating: raw_data/ptt_movie_review/__MACOSX/
  inflating: raw_data/ptt_movie_review/__MACOSX/._PPT_Movie_Review_train-1.txt  
  inflating: raw_data/ptt_movie_review/PPT_Movie_Review_test-1.txt  
  inflating: raw_data/ptt_movie_review/__MACOSX/._PPT_Movie_Review_test-1.txt  
***** Task data directory: raw_data/ptt_movie_review *****
test.txt  train.txt

Bert tfhub modules:

BUCKET = 'tpu-training-result' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'
OUTPUT_DIR = 'gs://{}/bert-tfhub/models/{}'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

# Available pretrained model checkpoints:
#   uncased_L-12_H-768_A-12: uncased BERT base model
#   uncased_L-24_H-1024_A-16: uncased BERT large model
#   cased_L-12_H-768_A-12: cased BERT large model
#   chinese_L-12_H-768_A-12
BERT_MODEL = 'chinese_L-12_H-768_A-12' #@param {type:"string"}
***** Model output directory: gs://tpu-training-result/bert-tfhub/models/PTTMovieReviews *****

Define tokenizer

Now let's load tokenizer module from TF Hub and play with it.

BERT_MODEL_HUB = 'https://tfhub.dev/google/bert_' + BERT_MODEL + '/1'

# def create_tokenizer_from_hub_module():
#   """Get the vocab file and casing info from the Hub module."""
#   with tf.Graph().as_default():
#       bert_module = hub.Module(BERT_MODEL_HUB)
#       tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
#       with tf.Session() as sess:
#           vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
#                                             tokenization_info["do_lower_case"]])
      
#   return bert.tokenization.FullTokenizer(
#       vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = run_classifier_with_tfhub.create_tokenizer_from_hub_module(BERT_MODEL_HUB)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
W0518 08:41:01.678604 140048423761792 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
I0518 08:41:04.145763 140048423761792 saver.py:1483] Saver not created because there are no variables in the graph to restore
tokenizer.tokenize("這是使用 BERT tokenizer 的一個例句")
['這',
 '是',
 '使',
 '用',
 '[UNK]',
 'to',
 '##ken',
 '##ize',
 '##r',
 '的',
 '一',
 '個',
 '例',
 '句']

Define task processors

from run_classifier import DataProcessor, InputExample, InputFeatures
import pandas as pd
class WSDMFakeNewsProcessor(DataProcessor):
    """Processor for WSDM - Fake News Classification Kaggle Competition
    https://www.kaggle.com/c/fake-news-pair-classification-challenge
    """

    def __init__(self):
        self.language = "zh"

    def get_train_examples(self, data_dir):
        df = pd.read_csv(os.path.join(data_dir, "train.tsv"), sep="\t")

        examples = []
        for (i, line) in enumerate(df.itertuples()):
            guid = "train-%d" % (i)
            text_a = tokenization.convert_to_unicode(line[1])
            text_b = tokenization.convert_to_unicode(line[2])
            label = tokenization.convert_to_unicode(line[3])
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

    def get_dev_examples(self, data_dir):
        df = pd.read_csv(os.path.join(data_dir, "dev.tsv"), sep="\t")
        examples = []
        for (i, line) in enumerate(df.itertuples()):
            guid = "dev-%d" % (i)
            text_a = tokenization.convert_to_unicode(line[1])
            text_b = tokenization.convert_to_unicode(line[2])
            label = tokenization.convert_to_unicode(line[3])
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

    def get_test_examples(self, data_dir):
        df = pd.read_csv(os.path.join(data_dir, "test.tsv"), sep="\t").fillna('')
        examples = []
        for (i, line) in enumerate(df.itertuples()):
            guid = "test-%d" % (i)
            text_a = tokenization.convert_to_unicode(line[1])
            text_b = tokenization.convert_to_unicode(line[2])
            label = "unrelated"
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

    def get_labels(self):
        """See base class."""
        return ["unrelated", "agreed", "disagreed"]
class PTTMovieReviewsProcessor(DataProcessor):
    """Processor for PTT Movie Reviews
    """
    
    def __init__(self):
        self.language = "zh"

    def get_train_examples(self, data_dir):
        df = pd.read_csv(os.path.join(data_dir, "train.txt"), header=None, sep="\t")
        
        examples = []
        for (i, line) in enumerate(df.itertuples()):
            guid = "train-%d" % (i)
            text_a = tokenization.convert_to_unicode(line[2])
            label = tokenization.convert_to_unicode(line[1])
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

    def get_dev_examples(self, data_dir):
        df = pd.read_csv(os.path.join(data_dir, "test.txt"), header=None, sep="\t")
        examples = []
        for (i, line) in enumerate(df.itertuples()):
            guid = "dev-%d" % (i)
            text_a = tokenization.convert_to_unicode(line[2])
            label = tokenization.convert_to_unicode(line[1])
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

    def get_labels(self):
        """See base class."""
        return ["N", "P"]
if TASK == "PTTMovieReviews":
    processor = PTTMovieReviewsProcessor()
elif TASK == "WSDMFakeNews":
    processor = WSDMFakeNewsProcessor()

label_list = processor.get_labels()
print("processor:", processor)
print("label_list:", label_list)
processor: <__main__.PTTMovieReviewsProcessor object at 0x7f5f4fbee898>
label_list: ['N', 'P']

Define hyperparameters

TRAIN_BATCH_SIZE = 256
EVAL_BATCH_SIZE = 8
PREDICT_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 200.0
MAX_SEQ_LENGTH = 256
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
SAVE_SUMMARY_STEPS = 500


# Compute number of train and warmup steps from batch size
train_examples = processor.get_train_examples(TASK_DATA_DIR)
num_train_steps = int(len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

# Setup TPU related config
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
NUM_TPU_CORES = 8
ITERATIONS_PER_LOOP = 1000

def get_run_config(output_dir):
  return tf.contrib.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      model_dir=output_dir,
      save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=ITERATIONS_PER_LOOP,
          num_shards=NUM_TPU_CORES,
          per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

Fine-tune and Run Predictions on a pretrained BERT Model from TF Hub

This section demonstrates fine-tuning from a pre-trained BERT TF Hub module and running predictions.

OUTPUT_DIR
'gs://tpu-training-result/bert-tfhub/models/PTTMovieReviews'
# Force TF Hub writes to the GS bucket we provide.
os.environ['TFHUB_CACHE_DIR'] = OUTPUT_DIR

model_fn = run_classifier_with_tfhub.model_fn_builder(
    num_labels=len(label_list),
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=True,
    bert_hub_module_handle=BERT_MODEL_HUB
)

estimator_from_tfhub = tf.contrib.tpu.TPUEstimator(
    use_tpu=True,
    model_fn=model_fn,
    config=get_run_config(OUTPUT_DIR),
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE,
    predict_batch_size=PREDICT_BATCH_SIZE,
)

At this point, you can now fine-tune the model, evaluate it, and run predictions on it.

# Train the model
def model_train(estimator):
  # We'll set sequences to be at most 128 tokens long.
  train_features = run_classifier.convert_examples_to_features(
      train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  print('***** Started training at {} *****'.format(datetime.datetime.now()))
  print('  Num examples = {}'.format(len(train_examples)))
  print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
  tf.logging.info("  Num steps = %d", num_train_steps)
  train_input_fn = run_classifier.input_fn_builder(
      features=train_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=True,
      drop_remainder=True)
  estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  print('***** Finished training at {} *****'.format(datetime.datetime.now()))

FakeNews dataset size: 288512

  • TPU: 1epoch -> 1024 (Wall time: 18min 8s)
  • GPU: 1epoch -> 1.2 sec / 1 steps * 9016 stesp (32 bsize) -> 180m
  • CPU: 1epoch -> 45 sec / 1 step 9016 steps (9016 32 bsize = dataset)
# %%time
# model_train(estimator_from_tfhub)
def model_eval(estimator):
  # Eval the model.
  eval_examples = processor.get_dev_examples(TASK_DATA_DIR)
  eval_features = run_classifier.convert_examples_to_features(
      eval_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  print('***** Started evaluation at {} *****'.format(datetime.datetime.now()))
  print('  Num examples = {}'.format(len(eval_examples)))
  print('  Batch size = {}'.format(EVAL_BATCH_SIZE))

  # Eval will be slightly WRONG on the TPU because it will truncate
  # the last batch.
  eval_steps = int(len(eval_examples) / EVAL_BATCH_SIZE)
  eval_input_fn = run_classifier.input_fn_builder(
      features=eval_features,
      seq_length=MAX_SEQ_LENGTH,
      is_training=False,
      drop_remainder=True)
  result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
  print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))
  output_eval_file = os.path.join(OUTPUT_DIR, "eval_results.txt")
  with tf.gfile.GFile(output_eval_file, "w") as writer:
    print("***** Eval results *****")
    for key in sorted(result.keys()):
      print('  {} = {}'.format(key, str(result[key])))
      writer.write("%s = %s\n" % (key, str(result[key])))
# model_eval(estimator_from_tfhub)
PREDICT_BATCH_SIZE
8
def model_predict(estimator):
  # Make predictions on a subset of eval examples
  prediction_examples = processor.get_dev_examples(TASK_DATA_DIR)[:PREDICT_BATCH_SIZE]
  input_features = run_classifier.convert_examples_to_features(prediction_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=True)
  predictions = estimator.predict(predict_input_fn)

  for example, prediction in zip(prediction_examples, predictions):
    print('text_a: %s\ntext_b: %s\nlabel:%s\nprediction:%s\n' % (example.text_a, example.text_b, str(example.label), prediction['probabilities']))
# model_predict(estimator_from_tfhub) 

Visualization

from google.colab import auth
auth.authenticate_user()

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'general-186304'
!gcloud config set project {project_id}
Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud alpha survey

# Download the file from a given Google Cloud Storage bucket.
!mkdir models
!mkdir models/{TASK}
!gsutil cp -r gs://{BUCKET}/bert-tfhub/models/{TASK} models/
  
# # Print the result to make sure the transfer worked.
# !cat /tmp/gsutil_download.txt
!pip install pytorch-pretrained-bert
from google.colab import files

uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving bert_config.json to bert_config.json
!cp bert_config.json models/{TASK}
!cp models/{TASK}/model.ckpt-1768.data-00000-of-00001 models/{TASK}/model.ckpt.data-00000-of-00001
!cp models/{TASK}/model.ckpt-1768.index models/{TASK}/model.ckpt.index
!cp models/{TASK}/model.ckpt-1768.meta models/{TASK}/model.ckpt.meta
BERT_BASE_DIR=f"models/{TASK}"

!pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
  {BERT_BASE_DIR}/model.ckpt \
  {BERT_BASE_DIR}/bert_config.json \
  {BERT_BASE_DIR}/pytorch_model.bin

clear_output()
%load_ext autoreload
%autoreload 2
import sys

!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
!rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/leemengtaiwan/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
rm: cannot remove 'bertviz_repo': No such file or directory
Cloning into 'bertviz_repo'...
remote: Enumerating objects: 80, done.
remote: Counting objects: 100% (80/80), done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 565 (delta 60), reused 0 (delta 0), pack-reused 485
Receiving objects: 100% (565/565), 37.03 MiB | 20.71 MiB/s, done.
Resolving deltas: 100% (354/354), done.
from bertviz import attention, visualization
from bertviz.pytorch_pretrained_bert import BertModel, BertTokenizer
!ls /usr/local/share/jupyter/nbextensions/google.colab/
colabwidgets  files.js	tabbar.css  tabbar_main.min.js
!find / -name require.js
/usr/local/lib/python3.6/dist-packages/notebook/static/components/requirejs/require.js
/usr/local/lib/python2.7/dist-packages/notebook/static/components/requirejs/require.js
^C
from google.colab import files

files.download('/usr/local/lib/python3.6/dist-packages/notebook/static/components/requirejs/require.js')
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min'
  }
});
 def call_html():
   import IPython
   display(IPython.core.display.HTML('''
         <script src="/static/components/requirejs/require.js"></script>
         <script>
           requirejs.config({
             paths: {
               base: '/static/base',
               "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
               jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
             },
           });
         </script>
         '''))
# TASK = "PTTMovieReviews"
from IPython.display import clear_output
bert_version = f'models/{TASK}'
model = BertModel.from_pretrained(bert_version, from_tf=True)
clear_output()
from google.colab import files

uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
!cp vocab.txt models/{TASK}
tokenizer = BertTokenizer.from_pretrained(bert_version)
sentence_a = "老爷爷自用的咽炎偏方,早上喝这个,3天见效,治一个好一个!"
sentence_b = "咽炎最佳治疗方法 这些小偏方治疗咽炎超管用"
attention_visualizer = visualization.AttentionVisualizer(model, tokenizer)
tokens_a, tokens_b, attn = attention_visualizer.get_viz_data(sentence_a, sentence_b)
call_html()
attention.show(tokens_a, tokens_b, attn)
Layer: Attention:

Fine-tune and run predictions on a pre-trained BERT model from checkpoints

Alternatively, you can also load pre-trained BERT models from saved checkpoints.

# # Setup task specific model and TPU running config.
# BERT_PRETRAINED_DIR = 'gs://cloud-tpu-checkpoints/bert/' + BERT_MODEL 
# print('***** BERT pretrained directory: {} *****'.format(BERT_PRETRAINED_DIR))
# !gsutil ls $BERT_PRETRAINED_DIR

# CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
# INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')

# model_fn = run_classifier.model_fn_builder(
#   bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
#   num_labels=len(label_list),
#   init_checkpoint=INIT_CHECKPOINT,
#   learning_rate=LEARNING_RATE,
#   num_train_steps=num_train_steps,
#   num_warmup_steps=num_warmup_steps,
#   use_tpu=True,
#   use_one_hot_embeddings=True
# )

# OUTPUT_DIR = OUTPUT_DIR.replace('bert-tfhub', 'bert-checkpoints')
# tf.gfile.MakeDirs(OUTPUT_DIR)

# estimator_from_checkpoints = tf.contrib.tpu.TPUEstimator(
#   use_tpu=True,
#   model_fn=model_fn,
#   config=get_run_config(OUTPUT_DIR),
#   train_batch_size=TRAIN_BATCH_SIZE,
#   eval_batch_size=EVAL_BATCH_SIZE,
#   predict_batch_size=PREDICT_BATCH_SIZE,
# )

Now, you can repeat the training, evaluation, and prediction steps.

# model_train(estimator_from_checkpoints)
# model_eval(estimator_from_checkpoints)
# model_predict(estimator_from_checkpoints)

What's next

  • Learn about Cloud TPUs that Google designed and optimized specifically to speed up and scale up ML workloads for training and inference and to enable ML engineers and researchers to iterate more quickly.
  • Explore the range of Cloud TPU tutorials and Colabs to find other examples that can be used when implementing your ML project.

On Google Cloud Platform, in addition to GPUs and TPUs available on pre-configured deep learning VMs, you will find AutoML(beta) for training custom models without writing code and Cloud ML Engine which will allows you to run parallel trainings and hyperparameter tuning of your custom models on powerful distributed hardware.

跟資料科學相關的最新文章直接送到家。
只要加入訂閱名單,當新文章出爐時,
你將能馬上收到通知